Skip to content

XPath nodeset always in document-order#320

Merged
naitoh merged 2 commits into
ruby:masterfrom
tompng:no_reverse_order
Jun 7, 2026
Merged

XPath nodeset always in document-order#320
naitoh merged 2 commits into
ruby:masterfrom
tompng:no_reverse_order

Conversation

@tompng
Copy link
Copy Markdown
Member

@tompng tompng commented May 26, 2026

Fixes problems described in #319
Change nodeset ordering to match XPath 1.0 spec
Changes XPath#first XPath#match ordering

In XPath 1.0 spec, nodeset are unordered set, but many operations are required to use document ordered.
Functions such as local-name should use first node in document order.
Filter expressions such as (preceding::foo)[1] need to use document-order.
JavaScript's XPathResult.ORDERED_NODE_ITERATOR_TYPE XPathResult.FIRST_ORDERED_NODE_TYPE always orders in document order.
This commit will make REXML match to this ordering, and also fixes inconsistent ordering bug: same xpath may return document-ordered nodes or reverse-document-ordered nodes depending on nodesets.size.

Copilot AI review requested due to automatic review settings May 26, 2026 12:49
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Aligns REXML XPath node-set results with XPath 1.0 “document order” requirements and fixes inconsistent ordering behavior across queries (notably for first/match and reverse axes like preceding*).

Changes:

  • Update XPath parser step processing to always return document-ordered node-sets (remove reverse ordering and unify ordering logic).
  • Adjust existing preceding / preceding-sibling expectations in tests to match document order semantics.
  • Add a new regression test to ensure ordering is consistent regardless of matched nodeset size.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
test/xpath/test_base.rb Updates ordering-related assertions and adds a regression test for consistent document ordering.
test/xpath/test_axis_preceding_sibling.rb Updates expected preceding-sibling ordering to document order.
lib/rexml/xpath_parser.rb Removes reverse-order option and enforces document-order sorting for step results.
Comments suppressed due to low confidence (1)

lib/rexml/xpath_parser.rb:1

  • This change removes the previous nodesets.size == 1 fast-path, so a single large nodeset now always pays the de-dupe + index-sort cost. If this code is performance-sensitive, consider reintroducing a safe optimization: when nodesets.size == 1 and the yielded nodeset is already known to be in document order (e.g., from a forward axis), skip the expensive index sort (while still sorting whenever nodesets.size > 1 or the axis is reverse/known non-document-ordered).
# frozen_string_literal: false

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread test/xpath/test_base.rb Outdated
Comment thread lib/rexml/xpath_parser.rb
Comment on lines 664 to 666
new_arry = []
array_of_nodes.each { |node|
node_idx = []
@tompng tompng mentioned this pull request May 26, 2026
Copilot AI review requested due to automatic review settings May 28, 2026 03:46
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

lib/rexml/xpath_parser.rb:1

  • With step now always funneling through sort(raw_nodes), sort will be exercised more broadly. The ordering logic assumes node.parent.index(node) is always a valid numeric index; that’s not true for node types that are not children in their parent’s children list (notably attributes, and potentially namespace nodes, depending on REXML’s node model). In those cases index(node) can be nil, which can make the sort key contain nil and raise during comparisons. Consider explicitly handling non-child node types when computing the sort key (e.g., derive an order relative to the owning element, or fall back to a stable tie-breaker) so sort is safe for all node kinds that XPath can return.
# frozen_string_literal: false

Comment thread lib/rexml/xpath_parser.rb Outdated
Comment on lines +462 to +474

seen = {}.compare_by_identity
raw_nodes = []
nodesets.each do |nodeset|
nodeset.each do |node|
raw_node = node.respond_to?(:raw_node) ? node.raw_node : node
next if seen.key?(raw_node)
seen[raw_node] = true
raw_nodes << raw_node
end
ordered_nodeset = sort(raw_nodes, order)
end
ordered_nodeset = sort(raw_nodes)

Comment thread lib/rexml/xpath_parser.rb
end
if nodesets.size == 1
ordered_nodeset = nodesets[0]
else
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tompng
Thank you for this PR.

But, performance has decreased because the optimization for nodesets.size == 1 has been removed.
Is there any way to improve this?

$ git checkout master
$ rake build
rexml 3.4.5 built to pkg/rexml-3.4.5.gem.
$ gem install pkg/rexml-3.4.5.gem
$ git checkout no_reverse_order
$ benchmark-driver benchmark/xpath_no_reverse_order.yaml
Calculating -------------------------------------
                     master(a98066c)       after  master(a98066c,YJIT)  after(YJIT) 
               child         280.227      19.016               544.496       31.084 i/s -     100.000 times in 0.356853s 5.258603s 0.183656s 3.217105s
          descendant          18.445       9.682                28.523       15.258 i/s -     100.000 times in 5.421532s 10.327928s 3.505883s 6.554030s

Comparison:
                            child
master(a98066c,YJIT):       544.5 i/s 
     master(a98066c):       280.2 i/s - 1.94x  slower
         after(YJIT):        31.1 i/s - 17.52x  slower
               after:        19.0 i/s - 28.63x  slower

                       descendant
master(a98066c,YJIT):        28.5 i/s 
     master(a98066c):        18.4 i/s - 1.55x  slower
         after(YJIT):        15.3 i/s - 1.87x  slower
               after:         9.7 i/s - 2.95x  slower
  • benchmark/xpath_no_reverse_order.yaml
loop_count: 100
contexts:
  - name: master(a98066c)
    gems:
      rexml: 3.4.5
    require: false
    prelude: require 'rexml'
  - name: after
    prelude: |
      $LOAD_PATH.unshift(File.expand_path("lib"))
      require 'rexml'
  - name: master(a98066c,YJIT)
    gems:
      rexml: 3.4.5
    require: false
    prelude: |
      require 'rexml'
      RubyVM::YJIT.enable
  - name: after(YJIT)
    prelude: |
      $LOAD_PATH.unshift(File.expand_path("lib"))
      require 'rexml'
      RubyVM::YJIT.enable

prelude: |
  require "rexml"
  xml = "<root>" + (1..2000).map { |i| "<item id='#{i}'/>" }.join + "</root>"
  doc = REXML::Document.new(xml)

benchmark:
  child: REXML::XPath.match(doc, "/root/item")
  descendant: REXML::XPath.match(doc, "//item")

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to improve this?

Yes, there are several ways.

  • A. Improve performance of sort

    • Should be done in a separate PR
  • B. Mark nodesets ordering

    • ordered_nodeset = marked_as_ordered? ? nodesets.first : nodesets.first.reverse
    • Ordering is currently :ordered | :reversed but in Optimize XPath step #315 we need to add :unordered
    • Conflicts with C
  • C. Delay nodeset ordering on demand

    • Sort the final result and in some functions such number(xpath)
    • Requires predicate position-independency analytics, requires sorted nodeset or not
    • After doing this, optimization of B needs to be extended. We need to re-implement B in a different way again.

I implemented B with changing the meaning of order: keyword parameter

  • Before: key to specify the output order: :forward | :reverse
  • After: key to specify the state of nodesets ordering: :ordered | :reversed | :unordered(future)

Perhaps this approach will be a blocker implementing C(delayed nodeset ordering), and optimizing XPathParser#step (#315) but we can consider it later.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented B with changing the meaning of order: keyword parameter

* Before: key to specify the output order: `:forward | :reverse`

* After: key to specify the state of nodesets ordering: `:ordered | :reversed | :unordered(future)`

Perhaps this approach will be a blocker implementing C(delayed nodeset ordering), and optimizing XPathParser#step (#315) but we can consider it later.

@tompng

Thank you.
I also think B is the best option, but can we keep :forward | :reverse in this PR?

I don't want to include :ordered | :reversed | :unordered(future) in this PR, because I can't review them at this time, as they are scheduled to be changed later.

I would like to include changes to :ordered | :reversed | :unordered(future) in the pull request when they are actually implemented.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Copilot suggested to rename the keyword parameter to :axis_order, what do you think?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your update.

Also, :axis_order is easier to understand.
Please change it to :axis_order.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@tompng tompng force-pushed the no_reverse_order branch 2 times, most recently from 1f46aef to 458b347 Compare June 7, 2026 12:59
Copilot AI review requested due to automatic review settings June 7, 2026 12:59
In XPath 1.0 spec, nodeset are unordered set, but many operations are required to use document ordered.
Functions such as local-name should use first node in document order.
Filter expressions such as `(preceding::foo)[1]` need to use document-order.
JavaScript's `XPathResult.ORDERED_NODE_ITERATOR_TYPE` `XPathResult.FIRST_ORDERED_NODE_TYPE` always orders in document order.
This commit will make REXML match to this ordering, and also fixes inconsistent ordering bug: same xpath may return document-ordered nodes or reverse-document-ordered nodes depending on `nodesets.size`.
@tompng tompng force-pushed the no_reverse_order branch from 458b347 to be3de1f Compare June 7, 2026 13:01
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

Comment thread lib/rexml/xpath_parser.rb Outdated
Comment on lines +472 to +474
end
if nodesets.size == 1
ordered_nodeset = nodesets[0]
ordered_nodeset = order == :forward ? nodesets.first : nodesets.first.reverse
Comment thread lib/rexml/xpath_parser.rb
end
end
ordered_nodeset = sort(raw_nodes, order)
ordered_nodeset = sort(raw_nodes)
Comment thread lib/rexml/xpath_parser.rb
# I wouldn't have to do this. Maybe add a document IDX for each node?
# Problems with mutable documents. Or, rewrite everything.
def sort(array_of_nodes, order)
def sort(array_of_nodes)
Copilot AI review requested due to automatic review settings June 7, 2026 13:55
Copy link
Copy Markdown
Contributor

@naitoh naitoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@naitoh naitoh merged commit 7396e9a into ruby:master Jun 7, 2026
72 of 73 checks passed
@tompng tompng review requested due to automatic review settings June 7, 2026 14:18
@tompng tompng deleted the no_reverse_order branch June 7, 2026 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants